Finding Community Structure Based on 
Subgraph Similarity 



Biao Xiang, En-Hong Chen, and Tao Zhou 



Abstract Community identification is a long-standing challenge in the modern net- 
work science, especially for very large scale networks containing millions of nodes. 
In this paper, we propose a new metric to quantify the structural similarity be- 
tween subgraphs, based on which an algorithm for community identification is de- 
signed. Extensive empirical results on several real networks from disparate fields 
has demonstrated that the present algorithm can provide the same level of reliabil- 
ity, measure by modularity, while takes much shorter time than the well-known fast 
algorithm proposed by Clauset, Newman and Moore (CNM). We further propose 
a hybrid algorithm that can simultaneously enhance modularity and save computa- 
tional time compared with the CNM algorithm. 



1 Introduction 

The study of complex networks has become a common focus of many branches of 
science (TJ. An open problem that attracts increasing attention is the identification 
and analysis of communities |2j. The so-called communities can be loosely defined 
as distinct subsets of nodes within which they are densely connected, while sparser 
between which [3|. The knowledge of community structure is significant for the 
understanding of network evolution (4) and the dynamics taking place on networks, 
such as epidemic spreading |5j|6) and synchronization HE). In addition, reasonable 
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identification of communities is helpful for enhancing the accuracy of information 
filtering and recommendation J9) . 

Many algorithms for community identification have been proposed, these include 
the agglomerative method based on node similarity ITOl . divisive method via itera- 
tive removal of the edge with the highest betweenness 0QT) , divisive method based 
on dissimilarity index between nearest-neighboring nodes lfl2l . a local algorithm 
based on edge-clustering coefficient 0~3|, Potts model for fuzzy community detec- 
tion lfl4l . simulated annealing 1151 . extremal optimization HI 61 . spectrum-based al- 
gorithm 1171 . iterative algorithm based on passing message lfl8l . and so on. 

Finding out the optimal division of communities, measure by modularity fTfl . 
is very hard fl9l . and for most cases, we can only get the near optimal division. 
Generally speaking, without any prior knowledge, such as the maximal community 
size and the number of communities, an algorithm that can give higher modular- 
ity is more time consuming ll20l . As a consequence, providing accurate division of 
communities for a very large scale network in reasonable time is a big challenge in 
the modern network science. To address this issue, Newman proposed a fast greedy 
algorithm with time complexity 0{n 2 ) for sparse networks [[2T). where n denotes the 
number of nodes. Furthermore, Clauset, Newman, and Moore (CNM) designed an 
improved algorithm giving identical result but with lower computational complexity 
11221 . as 0{nloq 2 n). In this paper, based on a newly proposed metric of similarity 
between subgraphs, we design an agglomerative algorithm for community identifi- 
cation, which gives the same level of reliability but is typically hundreds of times 
faster than the CNM algorithm. We further propose a hybrid method that can si- 
multaneously enhance modularity and save computational time compared with the 
CNM algorithm. 

The rest of this paper is organized as follows. In Section 2, we introduce the 
present method, including the new metric of subgraph similarity and the correspond- 
ing algorithm, as well as the hybrid algorithm. In Section 3, we give a brief de- 
scription of the empirical data used in this paper. The performance of our proposed 
algorithms for both algorithmic accuracy and computational time are presented in 
Section 4. Finally, we sum up this paper in Section 5. 



2 Method 

Considering an undirected simple network G(V,E), where V is the set of nodes 
and E is the set of edges. The multiple edges and self-connections are not allowed. 
Denote r = { Vi , V 2 , ■ ■ ■ , V h } a division of G, that is, V, n Vj = for \<i^j<h 
and V\ U V2 U • • • U V/, = V. We here propose a new metric of similarity between two 
subgraphs, V,- and Vj, as: 

... , yh -JwTj 
e n 1 2-ti PiTTl 
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where e,- ; is the number of edges with two endpoints respectively belonging to Vj and 
Vj (fiij is defined to be zero if i = j), \Vk\ is the number of nodes in subgraph Vk, and 
di = YtxeV kx is the sum of degrees of nodes in V,, where the degree of nodex, namely 
k x , is defined as the number of edges adjacent to x in G(V,E). The similarity here 
can be considered as a measure of proximity between subgraphs, and two subgraphs 
having more connections or being simultaneously closely connected to some other 
subgraphs are supposed to have higher proximity to each other, c/, can be considered 
as the mass of a subgraph, and the denominator, ^Jdjdj, is introduced to reduce the 
bias induced by the inequality of subgraph sizes. Note that, if each subgraph only 
contains a single node, as V, = {v,}, the similarity between too subgraphs, VJ- and Vj, 
is degenerated to the well-known Salton index (also called cosine similarity in the 
literature) l23l between v, and vj if they are not directly connected. 

Our algorithm starts from an n-division io = {Vi, V2, ■ ■ ■ ,V„} with Vj = {v,} for 
1 < i < n. The procedure is as follows, (i) For each subgraph V,, let it connect to the 
most similar subgraphs, namely {V/|sy = maxj-js/j.}}. (ii) Merge each connected 
component in the network of subgraphs generated by step (i) into one subgraph, 
which defines the next division, (iii) Repeat the step (i) until the number of sub- 
graphs equals one. During this procedure, we calculate the modularity for each di- 
vision and the one corresponding to the maximal modularity is recorded. To make 
our algorithm clear to readers, we show a small scale example consisted of six sub- 
graphs with similarity matrix: 

/022 1 l\ 
2 13 11 
2 10 10 1 
13 1020 
10 2 3 
\\ 1 1 03 0/ 



(2) 



After the step (i), as shown in Figure 1 , we get a network where each node represents 
a subgraph. We use the directed network representation, in which a directed arc 
from Vi to Vj means Vj is one of the most similar subgraphs to V, . In the algorithmic 
implementation, those directed arcs can be treated as undirected (symmetry) edges. 
The network shown in Figure 1 is determined by the similarity matrix S, and after 
step (ii), the updated division contains only two subgraphs, V\ U V2 U V3 U V4 and 
V5 U V(5, corresponding to the two connected components. Note that, the algorithmic 
procedure is deterministic and the result is therefore not sensitive to where it starts 
at all. 



The CNM algorithm is relatively rough in the early stage, actually, it strongly 
tends to merge lower-degree nodes together (see Eq. (2) in Ref. f2"T) . the first term 
is not distinguishable in the early stage while the enhancement of the second term 
favors lower-degree nodes). This tendency usually makes mistakes in the very early 
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Fig. 1 Illustration of the algorithm procedure, where each node represents a subgraph. The simi- 
larities between subgraph pairs are shown in Eq. (2). 

stage and can not be corrected afterwards. We therefore propose a hybrid algorithm 
which starts from a n-division Iq = {V\, Vj, ■ ■ ■ ,V„}, and takes the procedure men- 
tioned in the last paragraph for one round (i.e., step (i) and step (ii)). The subgraph 
similarity is degenerated to the similarity between two nodes: 



where n xy denotes the number of common neighbors between x and y, a xy is 1 if x 
and y are directly connected, and otherwise. After this round, each subgraph has at 
least two nodes. Then, we implement the CNM algorithm until all nodes are merged 
together. 



In this paper, we consider five real networks drawn from disparate fields: (i) 
Football. — A network of American football games between Division IA colleges 
during regular season Fall 2000, where nodes denote football teams and edges rep- 
resent regular season games 0. (ii) Yeast PPI. — A protein-protein interaction net- 
work where each node represents a protein 12411251 , (iii) Cond-Mat. — A network of 
coauthorships between scientists posting preprints on the Condensed Matter E-Print 
Archive from Jan 1995 to March 2005 (56). (iv) WWW. — A sampling network of 



a.xy + n 



r xy 



(3) 



s 




3 Data 
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Fig. 2 Comparison of the algorithmic outputs corresponding to the best identifications subject to 
modularity. The three panels are (upper panel) real grouping in regular season Fall 2000, (middle 
panel) resulting communities from the CNM algorithm, and (lower panel) resulting communities 
from the XCZ+CNM algorithm. Each node here denotes a football team and different colors rep- 
resent different groups/communities. 
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the World Wide Web (27). (v) IMDB. — Actor networks from the Internet Movie 
Database (28). We summarize the basic information of these networks in Table 1. 



Table 1 Basic information of the networks for testing. 



Networks 


Number of Nodes, |V| 


Number of Edges, |£| 


References 


Football 


115 


613 





Yeast PPI 


2631 


7182 


nam) 


Cond-Mat 


40421 


175693 


no 


WWW 


325729 


1090107 


|27| 


IMDB 


1324748 


3782463 


(28) 



Table 2 Maximal modularity. 



Algorithms Football Yeast PPI 



CNM 0.577 0.565 

XCZ 0.538 0.566 

XCZ+CNM 0.605 0.590 



Cond-Mat WWW IMDB 



0.645 0.927 N/A 

0.682 0.882 0.691 

0.716 0.932 0.786 



Table 3 CPU Time in millisecond (ms) resolution. 



Algorithms 


Football 


Yeast PPI 


Cond-Mat 


WWW 


IMDB 


CNM 


172 


5132 


559781 


12304152 


N/A 


XCZ 





47 


2022 


17734 


257875 


XCZ+CNM 





62 


36422 


443907 


47714093 



4 Result 



In Table 2 and Table 3, we respectively report the maximal modularities and the 
CPU times corresponding to the CNM algorithm, our proposed algorithm (referred 
as XCZ algorithm where XCZ is the abbreviation of the authors' names), and the 
hybrid algorithm (referred as XCZ+CNM). All computations were carried out in 
a desktop computer with a single Inter CoreE2160 processor (1.8GHz) and 2GB 
EMS memory. The programme code for the CNM algorithm is directly downloaded 
from the personal homepage of Clauset. The IMDB seems too large for the CNM 
algorithm, and we can not get the result in reasonable time. 
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From Table 2, one can find that the XCZ algorithm can provide competitively 
accurate division of communities verse the CNM algorithm. A significant feature 
of the XCZ algorithm is that it is very fast, in general more than 100 times fasters 
than the CNM algorithm. Just by a desktop computer, one can find out the commu- 
nity structure of a network containing 10 6 nodes within minutes. In comparison, the 
hybrid algorithm is remarkably more accurate (measured by the maximal modular- 
ity) than both the CNM and XCZ algorithms. In Figure 2, we compare the resulting 
community structures of the Football network, from which one can see obviously 
that the hybrid algorithm gives closer result to the real grouping than the CNM al- 
gorithm. We think the hybrid algorithm is fast enough for many real applications. 
Taking IMDB as an example, although it contains more than 1.3 x 10 6 nodes, the 
hybrid algorithm only spends less than one day. Indeed, the hybrid algorithm out- 
performs the CNM algorithm for both the accuracy and the speed. 



5 Conclusion 

Thanks to the quick development of computing power and database technology, 
many very large scale networks, consisted of millions or more nodes, are now avail- 
able to scientific community. Analysis of such networks asks for highly efficient 
algorithms, where the problem of community identification has attracted more and 
more attentions for its hardness and practical significance. 

The agglomerative method based on node similarity Hoi is of lower accuracy 
compared with the divisive algorithms based on edge-betweenness Q and edge- 
clustering coefficient lfl3l . In this paper, we extended the similarity measuring the 
structural equivalence of a pair of nodes to the so-called subgraph similarity that 
can quantify the proximity of two subsets of nodes. Accordingly, we deigned an 
ultrafast algorithm, which provides competitively accurate division of communities 
while runs typically hundreds of times faster than the well-known CNM algorithm. 
Using our algorithm, just by a desktop computer, one can deal with a network of 
millions of nodes in minutes. For example, it takes less than five minutes to get the 
community structure of IMDB, which is consisted of more than 1 .3 x 10 6 nodes. 

Furthermore, we integrated the CNM algorithm and our proposed algorithm 
and designed a hybrid method. Numerical results on representative real networks 
showed that this hybrid algorithm is remarkably more accurate than the CNM algo- 
rithm, and can manage a network of about one million nodes in a few hours. 

The modularity has been widely accepted as a standard metric for evaluating 
the community identification, as well as has found some other applications such as 
being an assistant for extracting the hierarchical organization of complex systems 
11291 . Although modularity is indeed the most popular metric for community iden- 
tification, and the result corresponding to the maximal modularity looks very rea- 
sonable (see, for example, Figure 2), it has an intrinsic resolution limit that makes 
small communities hard to detect l30l [3D . An alternative, named normalized mu- 
tual information ll20l is a good candidate for future investigation. In addition, an 
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extension of modularity for weighted networks, namely weighted modularity |32l , 
has been adopted to deal with community identification problem in weighted net- 
works l33l[34l . We hope the subgraph similarity proposed in this paper can also be 
properly extended to a weighted version to help extract the weighted communities. 
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